In the present paper, we tackle the problem of the compact and efficient representation of restricted lexical co-occurrence information in the lexicon along semantic lines. The theoretical framework for this study is the Meaning Text Theory (MTT) and, more specifically, the lexicographic part of MIT --- the Explanatory Combinatorial Dictionary (ECD), which contains for each lexeme (i) its semantic definition, (ii) a systematic description of its restricted lexical co-occurrence in terms of Lexical Functions (LF), and (iii) its Government Pattern . The data domain is the semantic field of emotion lexemes in German. In order to represent the restricted lexical co-occurrence (or collocations) of the lexemes in this field, we suggest the following procedure: 1. Construct approximate descriptions of their meaning, i.e. what we call the abridged lexicographic definitions . Formulated in terms of semantic features, these definitions are supposed to provide as much semantic information as necessary for establishing correlations between the semantic features of a lexeme and its collocates. 2. Specify their syntactic Government Patterns, which are needed for a clearer picture of their co-occurrence --- syntactic as well as lexical. 3. Specify their restricted lexical co-occurrence with the verbs chosen. 4. Establish correlations between the values of LFs and the semantic features in the abridged definitions of the emotion lexemes. 5. Based on these correlations, extract recurrent values of LFs (and recurrent Government Patterns) from individual lexical entries and list them under what we call the generic lexeme of the semantic field under study --- in this case, GEFÜHL 'emotion'. This leads on the one hand, to "compressed" lexical entries for emotion lexemes, and on the other hand, to the creation of a lexical entry of a new type: the "public" entry of a generic lexeme. Keywords: lexicography, lexicon, german emotion lexemes, lexical co-occurrence, collocations, meaning text theory, lexical functions, semantic features, semantico-lexical correlations, information extraction, inheritance, individual lexical subentry, public lexical subentry ; Leksikale verbindinge en leksikale erfenis. Emosielekseme in Duits: 'n Leksikografiese gevallestudie In hierdie dokument bespreek ons die probleem van die bondige en doeltreffende voorstelling volgens semantiese beginsels van inligting oor beperkte leksikale verbindinge in die leksikon. Die teoretiese raamwerk vir hierdie studie is die Teorie van Betekenisteks (MTT) en, meer spesifiek, die leksikografiese deel van MTT --- die Explanatory Combinatorial Dictionary (ECD), wat die volgende vir elke lekseem bevat: (i) sy semantiese definisie, (ii) 'n sistematiese beskrywing van sy beperkte leksikale verbindinge in terme van Leksikale Funksies en (iii) sy Bepalingspatroon . Die inligtingsterrein is die semantiese veld van emosielekseme in Duits. Om die beperkte leksikale verbindinge (of kollokasies) van die lekseme in hierdie veld voor te stel, doen ons die volgende prosedure aan die hand: 1. Stel benaderde beskrywings van hulle betekenis op, d.i. wat ons die afgekorte leksikografiese definisies noem. Geformuleer in terme van semantiese kenmerke, is hierdie definisies veronderstel om soveel semantiese inligting te voorsien as wat nodig is om korrelasies tussen die semantiese kenmerke van 'n lekseem en sy kollokasies vas te stel. 2. Spesifiseer hulle sintaktiese Bepalingspatrone, wat nodig is vir 'n duideliker beeld van hulle verbindinge --- sowel sintakties as leksikaal. 3. Spesifiseer hulle beperkte leksikale verbindinge met die gekose werkwoorde. 4. Stel korrelasies vas tussen die waardes van die LF's en die semantiese kenmerke in die verkorte definisies van emosielekseme. 5. Onttrek herhalende waardes van LF's (en herhalende Bepalingspatrone) aan individuele leksikale inskrywings wat op hierdie korrelasies gebaseer is en lys hulle onder wat ons die generiese lekseem van die semantiese veld onder bespreking noem --- in hierdie geval GEFÜHL 'emosie'. Dit lei enersyds na "verdigte" leksikale inskrywings vir emosielekseme, en andersyds na die skepping van 'n nuwe tipe leksikale inskrywing: die "algemene" inskrywing van 'n generiese lekseem. Sleutelwoorde: leksikografie, leksikon, duitse emosielekseme, leksikale verbindinge, kollokasies, teorie van betekenisteks, leksikale funksies, semantiese kenmerke, semanties-leksikale korrelasies, inligtingsonttrekking, erfenis, individuele leksikale subinskrywing, algemene leksikale subinskrywing
Comunicació presentada a: 10th Workshop on Building and Using Comparable Corpora (BUCC), celebrat a Vancouver, Canadà, del 30 de juliol al 4 d'agost de 2017. ; This paper presents a methodology to extract parallel speech corpora based on any language pair from dubbed movies, together with an application framework in which some corresponding prosodic parameters are extracted. The obtained parallel corpora are especially suitable for speech-to-speech translation applications when a prosody transfer between source and target languages is desired. ; This work is part of the KRISTINA project, which has received funding from the European Union's Horizon 2020 Research and Innovation Programme under the Grant Agreement number 645012. The second author is partially funded by the Spanish Ministry of Economy, Industry and Competitiveness through the Ram´on y Cajal program.
Comunicació presentada a: The 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017), celebrada a Estocolm, Suència, del 20 al 24 d'agost de 2017. ; This paper presents an open-source tool that has been developed to visualize a speech corpus with its transcript and prosodic features aligned at word level. In particular, the tool is aimed at providing a simple and clear way to visualize prosodic patterns along large segments of speech corpora, and can be applied in any research that involves prosody analysis. ; This work is part of the KRISTINA project, which has received funding from the European Union's Horizon 2020 Research and Innovation Programme under the Grant Agreement number 645012. The second author is partially funded by the Spanish Ministry of Economy, Industry and Competitiveness through the Ram´on y Cajal program.
Until very recently, the generation of punctuation marks for automatic speech recognition (ASR) output has been mostly done by looking at the syntactic structure of the recognized utterances. Prosodic cues such as breaks, speech rate, pitch intonation that influence placing of punctuation marks on speech transcripts have been seldom used. We propose a method that uses recurrent neural networks, taking prosodic and lexical information into account in order to predict punctuation marks for raw ASR output. Our experiments show that an attention mechanism over parallel sequences of prosodic cues aligned with transcribed speech improves accuracy of punctuation generation. ; We would like to thank Francesco Barbieri for offering his technical insights throughout this work. This work is part of the KRISTINA project, which has received funding from the European Union's Horizon 2020 Research and Innovation Programme under the Grant Agreement number H2020-RIA-645012. The second author is partially funded by the Spanish Ministry of Economy, Industry and Competitiveness through the Ramón y Cajal program.
En este trabajo presentamos una aproximación a la extracción automática de estructuras conceptuales a partir de colecciones desordenadas de documentos, aprovechando regularidades léxicas a gran escala en los textos. Es una técnica para asociar un término con una constelación de otros términos que refleje lo esencial del significado. La metodología es independiente de la lengua. Se explora una colección de documentos donde el término inicial aparece (como la colección que devuelve un motor de búsqueda con esa interrogación) y se construye una red en la que cada nodo es asignado a un término. La ponderación de las conexiones entre nodos se incrementa cuando los términos que representan aparecen juntos en un contexto de extensión predefinida. Posibles aplicaciones son la generación automática de mapas conceptuales, la extracción de terminología, la recuperación de términos, su traducción, localización, etc. El sistema se encuentra actualmente en desarrollo, sin embargo experimentos preliminares muestran resultados prometedores. ; In this paper, we present an approach to the automatic extraction of conceptual structures from unorganized collections of documents using large scale lexical regularities in text. The technique maps a term to a constellation of other terms that captures the essential meaning of the term in question. The methodology is language independent, it involves an exploration of a document collection in which the initial term occurs (e.g., the collection returned by a search engine when being queried with this term) and the building of a network in which each node is assigned to a term. The weights of the connections between nodes are strengthened each time the terms that these nodes represent appear together in a context of a predefined length. Possible applications are automatic concept map generation, terminology extraction, term retrieval, term translation, term localization, etc. The system is currently under development although preliminary experiments show promising results. ; This paper was supported by the ADQUA scholarship granted to the first author by the Government of Catalonia, Spain, according to the resolution UNI/772/2003.
Comunicació presentada a: The 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017), celebrada a Estocolm, Suència, del 20 al 24 d'agost de 2017. ; This paper presents a demonstration of a stochastic prosody tool for enrichment of synthesized speech using SSML prosody tags applied over hierarchical thematicity spans in the context of a CTS application. The motivation for using hierarchical thematicity is exemplified, together with the capabilities of the module to generate a variety of SSML prosody tags within a controlled range of values depending on the input thematicity label. ; This work is part of the KRISTINA project, which has received funding from the European Union's Horizon 2020 Research and Innovation Programme under the Grant Agreement number H2020-RIA-645012. It has been also partly supported by the Spanish Ministry of Economy and Competitiveness under the María de Maeztu Units of Excellence Programme (MDM-2015- 0502). The second author is partially funded by the Ramón y Cajal program.
Paper presented at Speech Prosody 8, 2016 May 31 - Jun 3; Boston, United States. ; Intonation is traditionally considered to be the most important prosodic feature, whereupon an important research effort has been devoted to automatic segmentation and labeling of speech samples to grasp intonation cues. A number of studies also show that when duration or intensity are incorporated, automatic prosody labeling is further improved. However, the combination of word level acoustic features still attains poor results when machine learning techniques are applied on annotated corpora to derive intonation for speech synthesis applications. To address this problem, we present an experimental set-up for the development of a hierarchical prosodic structure model which combines linguistic features, including information structure, and three acoustic elements (intensity, pitch and duration). We show empirically that this combination leads to a considerably more accurate representation of prosody and, consequently, a more reliable automatic labeling of speech corpora for machine learning. ; This work is part of a project that has received funding from the European Union's Horizon 2020 Research and Innovation/nProgramme under the Grant Agreement number H2020-RIA-645012. The second author is partially funded by a grant from/nthe Spanish Ministry of Economy and Competitivity in the framework of the Juan de la Cierva fellowship program.
Comunicació presentada a: 12th Language Resources and Evaluation Conference, celebrat en línia del 13 al 15 de maig de 2020. ; This paper introduces ThemePro, a toolkit for the automatic analysis of thematic progression. Thematic progression is relevant to natural language processing (NLP) applications dealing, among others, with discourse structure, argumentation structure, natural language generation, summarization and topic detection. A web platform demonstrates the potential of this toolkit and provides a visualization of the results including syntactic trees, hierarchical thematicity over propositions and thematic progression over whole texts. ; This work is part of the WELCOME project, which has received funding from the European Union's Horizon 2020 Research and Innovation Programme under the Grant Agreement number 870930.
Paper presented at Speech Prosody 8, 2016 May 31 - Jun 3; Boston, United States. ; State-of-the-art prosody modelling in content-to-speech (CTS) applications still uses the same methodology to predict intonation cues as text-to-speech (TTS) applications, namely the analysis of the generated surface sentences with respect to part of speech, syntactic dependency relations and word order. On the other side, several theoretical studies argue that morphology, syntax, and information (or communicative) structure that organizes/na given content (semantic or deep-syntactic structure) with respect to the intention of the speaker show a strong correlation with intonation. However, little empirical work based on sufficiently large corpora has been carried out so far to buttress this argumentation. We present empirical evidence for the Information Structure–Prosody correlation using the Wall Street Journal Penn Treebank corpus recorded by native American English speakers. Our experiments reach a prosody prediction accuracy of 80% using the hierarchical information structure from the Meaning-Text Theory, compared to 59% of the baseline. ; This work is part of a project that has received funding from the European Union's Horizon 2020 Research and Innovation/nProgramme under the Grant Agreement number H2020-RIA-645012. The second author is partially funded by a grant from/nthe Spanish Ministry of Economy and Competitivity in the framework of the Juan de la Cierva fellowship program.
Comunicació presentada a: The Language Resources and Evaluation Conferece, celebrada del 7 al 12 de maig de 2018 a Miyazaki, Japó. ; Theoretical studies on the Information Structure–prosody interface argue that the content packaged in terms of theme and rheme correlates with the intonation of the corresponding sentence. However, there are few empirical studies that support this argument and even fewer resources that promote reproducibility and scalability of experiments. In this paper, we introduce a methodology for the compilation of annotated corpora to study the correspondence between Information Structure and prosody. The application of this methodology is exemplified on a corpus of read speech in English annotated with hierarchical thematicity and automatically extracted prosodic parameters. ; This work is part of the KRISTINA project, which has received funding from the European Union's Horizon 2020 Research and Innovation Programme under the Grant Agreement number H2020-RIA-645012. It has been also partly supported by the Spanish Ministry of Economy and Competitiveness under the Mar´ıa de Maeztu Unit of Excellence Programme (MDM-2015-0502), and the third author is partially funded by the Ramón y Cajal program.
Comunicació presentada a la 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015), celebrada del 31 de maig al 5 de juny 2015 a Denver (CO, EUA). ; "Deep-syntactic" dependency structures bridge the gap between the surface-syntactic structures as produced by state-of-the-art dependency parsers and semantic logical forms in that they abstract away from surfacesyntactic idiosyncrasies, but still keep the linguistic structure of a sentence. They have thus a great potential for such downstream applications as machine translation and summarization. In this demo paper, we propose an online version of a deep-syntactic parser that outputs deep-syntactic structures from plain sentences and visualizes them using the Brat tool. Along with the deep syntactic structures, the user can also inspect the visual presentation of the surface-syntactic structures that serve as input to the deep-syntactic parser and that are produced by the joint tagger and syntactic transition-based parser ran in the pipeline before deep-syntactic parsing takes place. ; This work has been partially funded by the European Union's Seventh Framework and Horizon 2020 Research and Innovation Programmes under the Grant Agreement numbers FP7-ICT-610411, FP7-SME- 606163, and H2020-RIA-645012.
Comunicació presentada a: The 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017), celebrada a Estocolm, Suència, del 20 al 24 d'agost de 2017. ; This work aims to explore the correlation between the discourse structure of a spoken monologue and its prosody by predicting discourse relations from different prosodic attributes. For this purpose, a corpus of semi-spontaneous monologues in English has been automatically annotated according to the Rhetorical Structure Theory, which models coherence in text via rhetorical relations. From corresponding audio files, prosodic features such as pitch, intensity, and speech rate have been extracted from different contexts of a relation. Supervised classification tasks using Support Vector Machines have been performed to find relationships between prosodic features and rhetorical relations. Preliminary results show that intensity combined with other features extracted from intra- and intersegmental environments is the feature with the highest predictability for a discourse relation. The prediction of rhetorical relations from prosodic features and their combinations is straightforwardly applicable to several tasks such as speech understanding or generation. Moreover, the knowledge of how rhetorical relations should be marked in terms of prosody will serve as a basis to improve speech synthesis applications and make voices sound more natural and expressive. ; This work is part of the KRISTINA project, which has received funding from the European Union's Horizon 2020 Research and Innovation Programme under the Grant Agreement number 645012. The second author is partially funded by the Spanish Ministry of Economy, Industry and Competitiveness through the Ramón y Cajal program. The third and fourth authors are partially funded by ANPCYT PICT 2014-1561, and the Air Force Office of Scientific Research, Air Force Material Command, USAF under Award No. FA9550-15-1-0055.
The last decade saw the rise of research in the area of hate speech and abusive language detection. A lot of research has been conducted, with further datasets being introduced and new models put forward. However, contrastive studies of the annotation of different datasets also revealed that some problematic issues remain. Theoretically ambiguous and misleading definitions between different studies make it more difficult to evaluate model reproducibility and generalizability and require additional steps for dataset standardization. To overcome these challenges, the field needs a common understanding of concepts and problems such that standard datasets and different compatible approaches can be developed, avoiding inefficient and redundant research. This article attempts to identify persistent challenges and develop guidelines to help future annotation tasks. Some of the challenges and guidelines identified and discussed in the article relate to concept subjectivity, focus on overt hate speech, dataset integrity and lack of ethical considerations.